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ABSTRACT 

Automated test assembly is a technology for producing 
multiple, equivalent test forms from an item pool. An important consideration 
for test security in automated test assembly is the inclusion of the same 
items on these multiple forms. Although it is possible to use item selection 
as a formal constraint in assembling forms, the number of constraints is 
often so large to begin with that imposing additional constraints may produce 
unsatisfactory results. This paper proposes an alternative method for 
controlling item allocation that is based on randomization. An example from 
an actual item pool is presented to illustrate the method. Results show that 
it' is possible to control the overall allocation of items across multiple 
test forms assembled through automated assembly methods using the same 
procedure that is used to control for item exposure in computerized adaptive 
testing situations. The iterative procedure was programmed directly into the 
f orm-assembly code, so that iterations become part of the assembly process. 
The goal is to produce the desired item allocation across forms, rather than 
to obtain exposure-control parameters for each item. (Author/SLD) 
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Controlling Item Allocation in the Automated Assembly of 

Multiple Test Forms 



Judith Spray 
Chuan-Ju Lin 
Troy T. Chen 



Abstract 



Automated test assembly is a technology for producing multiple, equivalent test forms 
from an item pool. An important consideration for test security in automated test assembly is the 
inclusion of the same items on these multiple forms. Although it is possible to use item selection 
as a formal constraint in assembling forms, the number of constraints is often so large to begin 
with that imposing additional constraints may produce unsatisfactory results. In this paper we 
propose an alternative method for controlling item allocation that is based on randomization. An 
example from an actual item pool is presented to illustrate the method. 
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Controlling Item Allocation in the Automated Assembly of 
Multiple Test Forms 

The automated assembly of multiple test forms for online delivery offers an alternative to 
a single, computer-administered, fixed test form or even a computerized-adaptive test. The 
constructed forms are usually assembled according to a set of content and psychometric 
specifications obtained from a reference test (i.e., a test form that has been administered 
previously and has exhibited acceptable results in terms of form difficulty, variability, reliability, 
passing rate or other psychometric considerations). If the constructed tests all meet these 
reference specifications, by making some assumptions concerning the operating characteristics 
of the items, the test forms can be thought of as equivalent in some sense. For example, if the 
psychometric specifications refer to the first and second moments of target difficulty and 
variability for each individual examinee, the constructed test forms would be parallel if all of the 
psychometric specifications were met across all of the test forms. The result is that a single 
passing standard or score could be used across forms, eliminating the need for post- 
administration equating or the establishment of separate passing scores for each form. 

The multiple forms may or may not consist of unique test items. Frequently, item pools 
from which the forms are constructed are small relative to the length and the number of forms 
required. Consequently, individual items may appear on more than one form. For example, if 
we were assembling five forms of the same test from a pool of items, each item within the pool 
would appear on either 0, 1, 2, 3, 4, or 5 forms. The number of items, n m , that appear on 
m = 0, 1, 2, 3, 4, 5 forms represents the allocation of items across the five test forms. We refer to 
the appearance of items across multiply constructed test forms as item allocation. 

If enough items appear frequently on many forms, the security of the items and the 
validity of the test results could be in question. One of the goals of the test assembly or 
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construction process should be to minimize test-overlap rate, defined as the proportion of items 
shared between any two forms. One way to do this is to include item usage as a constraint or 
target in the solution of the assembly problem. However, this may be unnecessary, especially if 
the form-assembly problem is burdened with numerous other constraints such as multiple levels 
of content categories and key balancing requirements, in addition to the psychometric 
requirements of the test forms. And any constraint that forces items onto a test form may end up 
doing so at the expense of other constraint goals. It may be more efficient to implement a 
simpler process to control the allocation of items across multiple forms. The purpose of this 
paper is to illustrate, by example, a simple randomization process that controls item allocation by 
minimizing the average test-overlap rate between pairs of test forms while producing tests that 
meet content and psychometric assembly constraints. 

Ideal Item Allocation across Multiple Test Forms 

What is the most ideal distribution or allocation of test items across multiple, equivalent 
test forms constructed from the same item pool? Obviously, the most desirable distribution or 
allocation from a test security standpoint is one in which there are no shared items across the 
forms. However, the item pool would have to be quite large relative to the length of each test 
form and the number of forms required to achieve this ideal. In addition, the pool would have to 
consist of enough “good” items so that all of the psychometric constraints could be met. And 
obviously if there were content constraints as well, there would have to be a sufficient number of 
items within each content category to satisfy the assembly goals. 

If such an ideal allocation cannot occur, one might ask what is “next-best”? From a test 
security perspective, we want to minimize the number of times that an item appears on every 
constructed form or nearly every constructed form. And from a test development perspective, we 
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do not want the situation where a large proportion of available items in the pool never appears on 
a single form. The latter situation would appear to be a waste of development time and money. 
To accomplish this goal, we present a method of controlling item appearances on multiple test 
forms that is derived from random sampling without replacement. This method can be 
implemented with any automated test-assembly procedure. It is based on the idea that if one 
could guarantee that the psychometric constraints would be met, the best way to safeguard 
overexposure of items would be to select them from the pool or each content category at random 
without replacement. If this were possible, the resulting allocation of items across forms would 
be defined as optimal, in the sense that it minimizes average test overlap of the constructed test 
forms. This claim is substantiated later in this paper. 

Traditional Method of Controlling Item Exposure in CAT 

Because the method of controlling item inclusion on assembled test forms is very similar 
to the traditional tactic used to manage computerized adaptive testing (CAT) programs, it is 
helpful to review that approach. The typical method of controlling for item exposure in CAT 
situations is to use a conditional approach first suggested by Sympson and Hetter (1985). For this 
procedure, a maximum expected item-exposure rate, r, is first established. The goal is to find a 
set of item-exposure-control parameters that govern the administration of items in a CAT item 
pool in such a way that no single item is ever administered more than H00% of the time, where 
0 < r < 1. 

The approach is called conditional because it is formulated within the context of a 
conditional probability statement. If P,(S) is the probability that item i is selected for a CAT 
administration, and P,(S,A) is the probability that item i is selected and administered (i.e., 
exposed), then an item’s exposure control parameter is simply P,(A|S), the probability of 
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administering an item, given that it has been selected, or P,(A|S) = P,<S,A) P,{S). The purpose 

of this conditional probability is to allow the item to be administered only if the conditional 
probability is satisfied, thus controlling for the exposure of that item. 

If P,(S,A) is replaced by the target-exposure rate, r, CAT simulations and an iterative 
procedure are used to obtain a value of P,{A|S) for each item in the pool. Simulated examinees, 
similar in number and ability distribution to the intended CAT examinee population, are 
administered items selected from the CAT item pool. The values of P,(S) are usually all set to 1 .0 
at the beginning of a set of simulations. The items are then selected on their ability to satisfy 
whatever constraints are required (e.g., maximum information at ability estimates, content 
specifications). However, they are only administered if a uniform random deviate is less than or 
equal to r * P,(S). If it is not, the items are temporarily set aside until all other items have been 
administered to a particular examinee or the pool has been exhausted. After all N simulated 
examinees have taken the CAT, and the number of times each item has been selected, 5„ has been 
counted, P,(S) is replaced by (5, * TV) and the process begins again. P,(S) continues to be refined 
until such time that the proportion of times that an item has been selected and administered across 
all examinees, or ( Ai -r TV), is close to the target value r. The number of iterations of P,(S) 
required before (A, -s- TV) approaches r is usually fairly small (Sympson & Hetter, 1985). The 
result is that P,(A|S) stabilizes, subsequently to be used in real CAT administrations to control 
item usage or exposure at a rate < r across the examinee population. Obtaining P,(A|S) for each 
item in the pool is thus the goal of the simulation and iteration process for CAT. 

The number of times that an item has been administered or exposed, A,-, can be assumed 
to be a binomial random variable with parameters P,(S,A), abbreviated as simply P„ and TV, or 
Aj~ Bin (P„ TV). The variance of (A, -r TV), is small for large TV, and therefore ( Aj TV) approaches 
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P,. However, the binomial distribution of A t changes throughout the simulation and iteration 
process. The use of P,(A|S) to control when items are administered during the simulations causes 
Pi to approach r iteratively for the most popular items (i.e., those that have desirable 
psychometric, content, and other required characteristics), while remaining less than r (i.e., 
approaching a value less than r) for less desirable items. 

How fast and which items converge 1 to r (or a value less than r) somewhat depends on 
the value of r and its relation to the observed, average item-exposure rate, (Z[P,] 4- n). Chen, 
Ankenmann, and Spray (1999) showed that, regardless of the pool size, n and fixed CAT test 
length, k, the average item-exposure rate of any fixed-length CAT is equal to ( k 4- n). Because the 
target rate, r, is considered to be a maximum allowable rate for any single item, it is obvious that r 
must be chosen so that r>{k 4- n). Chen, et al. (1999) further showed that the average test-overlap 
rate, T , is a function of P,. Specifically, 



p 1=1 



k(N-l) N - 1 



By completing the square in equation (1) above, they then showed that 
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This simplifies to 
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1 We note that the term, convergence, as used in this paper, describes the iterative process whereby the rates with 

which items in the pool are administered change after each iteration. Because the sum of these rates must always 
equal the length of the test, k, only variance of these rates can change; it decreases iteratively until it stabilizes. 
Thus, the term does not connote a statistical convergence, say in distribution or probability. 
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which is equivalent to 
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( 5 ) 



n N 

Because the Chen, et al. (1999) paper was concerned with CAT where N is typically very large, 
they used a large-sample approximation for average test-overlap rate or 

_ Var(P,) + ^ 

T = (6) 



n 

The average item exposure, ( k h- «), is also the probability of drawing k items from an item pool 
of size n randomly without replacement (see Appendix). In fact Chen, et al. (1999) showed that 
when Pi = (k + n), for all i, T reaches its minimum value of ( k -r n) (i.e., when the variance of P, 
is zero, the minimum value of T occurs). This suggests that perhaps the target rate, r, could be 
set to ( k 4- n) to minimize test overlap. However, because items are selected based on their 
psychometric and other characteristics and are not actually drawn at random, r = (k+n) is not a 
realistic target (Chen, et al., 1999). Still, a target value slightly higher than ( k 4 «) might be quite 
realistic and would produce a lower test overlap if this target could be reached by a majority of 
the items during the simulation-iteration process described earlier. 
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Controlling Item Allocation across a Small Number of Test Forms 
In the CAT situation, N represents the number of tests that are to be given, or in this case, 
the examinee-population size. However, when multiple test forms are constructed for 
administration via computer at a later time, N represents the number of forms to be assembled. 
In this situation, N may be fairly small. This difference in definition and, hence, size, results in a 
slightly different interpretation of the goal of the Sympson-Hetter procedure. Because N is 
small, (Ai -5- N) will not converge to P,. However, the behavior of At can only be described by its 
probability density function or pdf, A t ~ Bin(P,, N), or 



Prob(A ; = m . ) = 



(N 









( 7 ) 



The allocation of n items across N forms is the sum of these pdfs or 



f N ^ 



£[Prob(4 =*,)] = I 

1=1 1=1 \\ m iJ I 



( 8 ) 



Likewise, each P, will not converge closely to the target rate, r, when N is small. With only 
N + 1 possible values for the estimates of P, to assume, it is even difficult to obtain a large degree 
of stability of the estimates. However, the variance of the estimates of P, will stabilize, even 
after a small number of iterations. 

In theory, if we set r = (k n) we should get the item allocation that one would achieve 
with the random sampling of k items from a pool of n items without replacement. This would 
also lead to the minimum average test-overlap rate, T , as in the CAT situation. However, once 
again, achieving the minimum test-overlap rate while meeting test-assembly specifications may 
not be possible, and a target that is slightly higher than ( k -r n) will probably need to be used. 
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Except for the size of N, the iterative process for CAT and for the assembly of multiple 
test forms is the same. A good stopping rule for the CAT iterations is to stop the process when 
the maximum exposure rate observed in the CAT item pool is “nearly” r, where “nearly” must be 
defined. For the multiple forms assembly, T can be used to stop the process. We select the item 
allocation that results when T is a minimum and all assembly constraints have been satisfied. 
Therefore, a number of iterations are specified arbitrarily and the chosen item allocation across 
forms is the one that produces the minimum value of T from these iterations while meeting all 
assembly requirements or constraints. Usually only a few iterations are necessary, as in the CAT 
situation. 

Example 

We have illustrated this procedure using a sample pool containing 247 items. Tests were 
constructed to be 75 items in length, and eight test forms were assembled to have the same 
average difficulty level (in terms of number-correct score) and variability (in terms of the 
standard deviation of observed test scores) as a reference form. We used the heuristic procedure 
developed by Swanson and Stocking (1993) using their weighted deviations model or WDM. 
When assembled without item-exposure control 2 , the observed test-overlap rate for the 
construction of eight forms was .41. This meant that, on average, 41% of the items on each form 
were also on another form. The allocation of items without exposure control is given in Table 1 
in the second column. 

If 75 items were drawn completely at random without replacement from the pool with 
probability (7 5 -=- 247) to create eight forms without regard to psychometric requirements, the 



2 In order to assemble multiple forms without item-exposure control, the first item included on a form is selected 
randomly. Thereafter, items are selected for inclusion based on the WDM criteria. Without random selection for 
the first item, all eight forms would be identical. 



O 

ERIC 



13 



9 



item allocation across the eight forms can be obtained from equation (8) using P, = (k+ n) = .30. 

Note that this is also the value of T . These results appear in the fourth column of Table 1. 
Although unattainable in practice, we used this ideal allocation as a baseline against which to 
compare the item allocation that we achieved following the Sympson-Hetter iterations. 

In this example, we increased the value of r on successive computer runs until a value of 
r = .36 produced eight forms that met all psychometric constraints and yielded a minimum value 
for T . These results are given in the third column of Table 1. Thus, our results fell somewhere 
between the item allocation observed with no exposure control (the second column) and the 
random or ideal allocation (the fourth column). The use of the Sympson-Hetter procedure to find 
the item allocation with the smallest average test-overlap rate, T , with all psychometric 
constraints or requirements satisfied reduced the value of T from .41 to .31. The number of 
items that appeared on all eight forms was reduced from 13 to 0, while the number of items that 
never appeared on a single form was reduced from 25 to 14. 

TABLE 1 



Item Allocations from the Sample Item Pool 



# of Test Forms 

(m) 


Without Item-Exposure 
Control 

(# of Items) 


With Item- 
Exposure Control 
r = .36 
(# of Items) 


Random 
Distribution 
r= (k + n) 

(# of Items) 


0 


25 


14 


14 


1 


55 


50 


48 


2 


74 


75 


73 


3 


47 


56 


63 


4 


19 


32 


34 


5 


9 


17 


12 


6 


4 


2 


3 


7 


1 


1 


0 


8 


13 


0 


0 


Test-Overlap Rate 


.41 


.31 


.30 
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Item Allocations and Test Assembly under Content Constraints 

The previous discussion centered on a simple assembly problem in which only 
psychometric constraints had to be met. However, in most multiple-form assembly problems, 
additional conditions or constraints involving content requirements also must be satisfied. In this 
situation there are / content categories, j = 1, 2, ...,/, so that the item pool of size n is stratified 
into n u n 2 , ...,nj mutually exclusive partitions. The test-assembly specifications require that k\, 
k 2 , ■■■,kj items from each of these content categories appear on each assembled form, in addition 
to psychometric constraints. 

The average test-overlap rate increases with additional content constraints because the 
required number of items must be drawn from smaller pools of size nj rather than from n. 
Therefore, more overlap is expected, especially from those content categories where kj is large 
relative to nj. We can compute the minimal test-overlap rate, T mn , that would result if each test 
form were assembled by drawing kj items randomly from categories of size tij without 
replacement. Even though the average item-exposure rate will remain equal to ( k -s- n), the 
random sampling would be stratified so that the value of P t would depend upon the content 
category for that item. For stratified random sampling without replacement, the probability of an 
item being selected from content category j is (kj + nj). Thus, from equation (5), the variance of 
P i would not be zero and T would increase. However, the computation of T from equation (5) 
under stratified random sampling would still yield a baseline test-overlap rate to use as a 
reference, along with an expected item allocation from equations (7) and (8). 

In our sample pool, items were categorized by one of 37 mutually exclusive categories. 
One of the categories had only a single item represented in the pool. The test specifications 
called for exactly one item from this category; therefore, it was expected that this item had to 
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appear on all eight forms. The expected item allocation across eight forms from stratified 
random sampling appears in Table 2 in the fourth column. The item allocation without exposure 
control appears in the second column of this table. 

Using rj = ( kj 4- nj) as the ideal target, we again experimented by adding a small constant, 
5, to the ideal and found the smallest value of 5 that would result in a minimal value of T and 
still meet all assembly constraints, both psychometric and content 3 . This value was 5 = .05. The 

results showed that this reduced the value of T from .49 to .36. 

TABLE 2 



Item Allocations from the Sample Item Pool with Content Constraints 



# of Test Forms 
(«) 


Without Item-Exposure 
Control 

(# of Items) 


With Item- 
Exposure Control 
rj= {kj h- nj)+ .05 
(# of Items) 


Random 

Distribution 

n = ( k j + n j) 
(# of Items) 


0 


41 


25 


24 


1 


59 


61 


51 


2 


57 


50 


61 


3 


32 


51 


52 


4 


24 


28 


34 


5 


9 


20 


17 


6 


4 


11 


6 


7 


2 


0 


1 


8 


19 


1 


1 


Test-Overlap Rate 


.49 


.36 


.35 



3 There is probably an ideal constant, 8 y, for each content category, that would produce a slightly better allocation of 
items. The time required to find J such values, however, may not justify the small benefit in this example. There 
may be other situations in which the determination of J distinct values of 8 would be worthwhile. 
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Summary 

Our results indicated that we could control the overall allocation of items across multiple 
test forms assembled via automated assembly methods using the same procedure that is used to 
control for item exposure in CAT situations. The iterative procedure was programmed directly 
into the form-assembly code. Thus, no “pre-assembly” work had to be done, as is done in CAT 
to obtain the values of P,(A|S) for later testing. In this case the iterations were a part of the 
assembly process, and the goal was to produce the desired item allocation across forms, rather 
than to obtain exposure-control parameters for each item. 
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Appendix 

We desire the probability that one of k items will be drawn without replacement from an 
item pool containing n items. The easiest way to approach the problem is to compute the 
probability that an item will not be drawn, even after k attempts. Our desired probability is then 
the complement of this probability. 

The probability that an item will not be drawn on the first attempt is [(n -1) 4 n\. The 
probability that the item will not be drawn without replacement on the second attempt is [(n - 2) 
4 (n - 1)]. For the third attempt, it is [(« - 3) 4 {n- 2)]. For the k ih and last attempt, it is [(n - k) 
4 (n - k + 1) ]. Because these are independent draws, the probability that the item will not be 
drawn after all k attempts is their product, or 

(n-l)(n-2) (n-3) (n-ft) = A (w-i) 
n (n - 1) (n - 2) (n - k + 1) ,.f (n — i 4 1) 

which, after cancellation, simplifies to (n - k) 4 n. Therefore, the probability that an item will be 
selected without replacement is 1 - {n - k) 4 n or (k 4 n). 
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